Flexible Text Segmentation with Structured Multilabel Classification

نویسندگان

  • Ryan T. McDonald
  • Koby Crammer
  • Fernando Pereira
چکیده

Many language processing tasks can be reduced to breaking the text into segments with prescribed properties. Such tasks include sentence splitting, tokenization, named-entity extraction, and chunking. We present a new model of text segmentation based on ideas from multilabel classification. Using this model, we can naturally represent segmentation problems involving overlapping and non-contiguous segments. We evaluate the model on entity extraction and noun-phrase chunking and show that it is more accurate for overlapping and non-contiguous segments, but it still performs well on simpler data sets for which sequential tagging has been the best method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Database-Text Alignment via Structured Multilabel Classification

This paper addresses the task of aligning a database with a corresponding text. The goal is to link individual database entries with sentences that verbalize the same information. By providing explicit semantics-to-text links, these alignments can aid the training of natural language generation and information extraction systems. Beyond these pragmatic benefits, the alignment problem is appeali...

متن کامل

Multilabel Classification through Structured Output Learning - Methods and Applications

Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi Author Hongyu Su Name of the doctoral dissertation Multilabel Classification through Structured Output Learning Methods and Applications Publisher School of Science Unit Department of Computer Science Series Aalto University publication series DOCTORAL DISSERTATIONS 28/2015 Field of research Information and Computer Science Manuscrip...

متن کامل

Diagnosis Code Prediction from Electronic Health Records as Multilabel Text Classification: A Survey

This article presents a survey on diagnosis code prediction from various information in Electronic Health Records (EHR): both unstructured free text and structured data. Particularly, our interests are in casting the problem as text classification with multiple sources and using neural network based models. We will first present previous work in this area and describe some simple baseline model...

متن کامل

Multi-Label Classification of Short Text: A Study on Wikipedia Barnstars

A content analysis of Wikipedia barnstars personalized tokens of appreciation given to participants reveals a wide range of valued work extending beyond simple editing to include social support, administrative actions, and types of articulation work. Barnstars are examples of short semi-structured text characterized by informal grammar and language. We propose a method to classify these barnsta...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005